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Abstract 

When using model checking to verify programs in prac- 
tice, it is not usually possible to achieve complete cov- 
erage of the system. In this position paper we describe 
ongoing research within the Automated Software Engi- 
neering group at NASA Ames on the use of test cov- 
erage metrics to measure partial coverage and provide 
heuristic guidance for program model checking. We 
are specifically interested in applying and developing 
coverage metrics for concurrent programs that might be 
used to support certification of next generation avionics 
software. 

1 Introduction 

Model checking and testing are conceptually close 
neighbors, because they both operate over executable 
system models. In practice model checking is mostly 
used to analyze high-level requirements and design 
models of a system and testing is predominately used 
for the analysis of implementations. The most com- 
mon link between these two techniques, suggested in 
the literature, has been to use model checking for test- 
case generation []. Recent advances in applying model 
checking to real programming languages [5, 8, 1, 10] 
has however opened up some interesting new ways in 
which testing and model checking can be used in tan- 
dem. 

Model checking is often claimed to be ’better’ than test- 
ing since all possible behaviors of a system are ana- 
lyzed - the implication being that model checking might 
catch subtle errors that testing might miss. While this 
is true in theory, real systems tend to have very large 
state-spaces (more so than designs in general). To re- 
duce the size of the state-spaces that must be searched, 
model checkers for programming languages typically 
use abstraction techniques. However effective abstrac- 
tion often requires expert user input, and as such is not 
currently a solution that will find wide industrial appeal. 


In case studies involving real systems, we have found 
that if an error exists it is often quite obvious (in hind- 
sight) and one can make it appear by only considering a 
few subtle interactions rather than a multitude of com- 
plex ones (a.k.a the low hanging fruit principle). Hence, 
only looking at part of the state-space (or behaviors) 
can be very effective for finding errors when using a 
model checker to analyze a real program. Therefore, 
model checking programs is very similar to program 
testing: neither technique scales well to high levels of 
behavioral coverage for real systems and both can be 
effective at finding errors by examining a subset of pro- 
gram behaviors. 

One notable difference between testing and model 
checking is that model checking is more suited to anal- 
ysis of concurrent and reactive programs because it has 
full control over the scheduling of processes, threads 
and events, which is not typically the case in test- 
ing. Also, because a model checker stores the pro- 
gram states that it has visited, it can be more efficient 
in covering the behaviors of a program [2]. In addi- 
tion, abstraction frameworks, such as abstract interpre- 
tation, can provide methods for constructing (conser- 
vative) over-approximations and m/c search optimiza- 
tions can provide (conservative) under-approximations, 
both of which can be used to battle state space explo- 
sion while maintaining ’full coverage’ or verification 
capabilities of the models. 

In the following sections we present several ways in 
which program model checking can be improved by 
taking advantage of the close relationship to testing. 
In Section 2 we discuss structural coverage measures 
from testing that can measure partial coverage by model 
checking, and how these measures can be used to guide 
a model checker’s execution. In Section 3 we show 
results obtained from extending the Java PathFinder 
model checker with the capability to calculate branch 
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coverage as well as use this coverage for guided search. 
Section 4 contain a short summary of the work pre- 
sented and how we hope to proceed with this research. 

2 Coverage for Model Checking 
During testing it is common to use structural code 
coverage measures, such as decision (or branch) cov- 
erage. to obtain confidence that a program has been 
adequately tested. Coverage metrics include state- 
ment, decision (or branch) For example, the FAA re- 
quires software testing to achieve 100% modified con- 
dition/decision coverage (MC/DC) in order to certify 
level A criticality software for flight [7], MC/DC cov- 
erage requires that all boolean conditions within a deci- 
sion independently affects the outcome of the decision. 

We are currently investigating whether similar cover- 
age measures can be used when analyzing only a part 
of the state-space of a program during model check- 
ing. The simple answer is yes: the output of a model 
checker now becomes either that an error was found and 
a path that shows how to get to it, or if none is found 
it returns a coverage measure that testing engineers can 
interpret as is done now. The real question is figuring 
out why this could be useful. A model checker that can 
calculate traditional structural coverage could be use- 
ful in answering at least one interesting question that 
many people, including the FAA, has been struggling 
with for some time: how good is the MD/DC cover- 
age measure at finding safety critical errors? [3]. But it 
would not be clear that a partial model checking result 
that includes 100% MC/DC coverage provides much of 
a guarantee of error-free operation. Note that achieving 
a certain structural coverage is not known to be use- 
ful in finding certain types of behavioral errors, such as 
timing/scheduling errors - i.e. exactly the ones model 
checking is good at finding. 

Model checkers are particularly suited to finding errors 
in concurrent programs, but many traditional coverage 
criteria are essentially only meaningful for sequential 
programs. So, although it would be interesting to see 
how these measures work in the concurrent context, it 
may be more appropriate to investigate whether there 
are coverage measures more suitable to model check- 
ing. For example, the concurrent structural coverage 
measures from the testing literature, such as all-du- 
paths [9, 11], may be appropriate. A more interesting 
approach may be to develop suitable behavioral cover- 
age measures. For example, “relevant path coverage” 


might be used to indicate coverage of the paths rel- 
evant to proving a property correct. Using behavior- 
based coverage metrics, it should be clear that program 
abstraction techniques, such as slicing and conserva- 
tive abstraction, still provide full coverage even though 
some paths are not (completely) checked. 

It is rather straight forward to argue that a model 
checker, during its normal operation, can calculate a 
coverage measure that can be used to evaluate how well 
the model checker did with respect to established test- 
ing measures. But, could these measures actually be 
used to improve model checking? For example, it is 
possible to guide the model checker to pick parts of the 
state-space to analyze based on structural coverage of 
the code that would generate those state-spaces. A sim- 
ple example would be to consider only statement cov- 
erage: if the model checker can next analyze a program 
statement that has never executed, versus one that has, 
then it picks the new one. 

3 Experiments with Java PathFinder 

We have been experimenting with some of these ideas 
within the Java PathFinder (JPF) model checker 1 devel- 
oped at NASA Ames. JPF is a model checker for Java 
programs that analyzes a program by executing the un- 
derlying bytecode instructions in a depth-first fashion. 
We modified the system such that it records the number 
of times a true and false branch in each branching in- 
struction are taken. On the level of the bytecodes this 
amounts to doing decision (or branch) coverage - if a 
branch was taken more than 0 times then that decision 
was covered. The model checker displays its current 
coverage during execution. It quickly became apparent 
that this measure is quite useful to show progress within 
the model checker - the coverage would converge after 
some time passed, and after that would increase spo- 
radically indicating some new code/behaviors are being 
executed. 

But most of our experiments were done while analyzing 
multi-threaded Java programs, and we noticed that cer- 
tain threads might achieve very low coverage although 
the overall coverage measure indicate high coverage for 
the complete program. We thus adapted the decision 
coverage to indicate coverage for each dynamically cre- 
ated thread in the program. Although this seemed to be 
the most obvious way to extend sequential coverage to 
concurrent coverage, we could not find any prior liter- 

1 Available from http://ase.arc.nasa.gov/jpf 


ature and hence believe this is a novel approach. Al- 
though this thread-based coverage is very useful to see 
progress within the model checker, it is conservative 
when used as coverage measure. The problem is that 
certain threads cannot execute certain parts of the code, 
and hence low coverage is obtained even in the extreme 
case where the model checker can cover all behaviors 
of the program. We believe this problem can easily be 
overcome by doing static-analysis on the program be- 
fore model checking to calculate more precisely what a 
thread’s potential coverage should be. 

To illustrate some of these concepts we will use the fol- 
lowing Java program that contains a deadlock. 

class Event( 
int count = 0; 

public synchronized void wait_for_event { ) { 
try (wait { ) ; } catch ( InterruptedExcept ion e) (} ; 

} 

public synchronized void signal_event ( ) { 
count = count ♦ 1; 
notifyAll ( ) ; 

} } 

class FirstTask extends ThreadC 
Event eventl , event2 ; 
int count = 0; 

public Firs tTask (Event el, Event e2) { 
this. eventl = el; this.event2 * e2 ; 

} 

public void run ( ) ( 

count = eventl . count ; 
while { true) { 

if (count =* event 1 . count } 
eventl . wait_f or_e vent { ) ; 
count = event I . count ; 
event2 . signal_event ( ) ; 

) ) } 

class SecondTask extends Thread( 

Event eventl , event 2 ; 
int count = 0; 

public SecondTask (Event el, Event e2)( 
this. eventl » el; this.event2 = e2; 

} 

public void run(){ 

count * event2 . count ; 
while ( true) ( 

eventl . signal_event ( ) ; 
if (count == event 2 . count ) ( 
event2 . wai t_f or_event ( > ; 

} 

count * event2 . count ; 

} > ) 

class Main { 

public static void main (String [] args) ( 

Event eventl * new Event (} ; 

Event event2 * new Event (); 

FirstTask taskl * new FirstTask (eventl, event2) ; 
SecondTask task2 = new SecondTask (eventl , event2 ) ; 
taskl . start {) ; task2 . start {) ,* 


JPF cannot find the deadlock in this program, before 
exhausting memory, because it searched in a depth-first 
fashion and the count variable in the Event class is in- 
cremented indefinitely, hence creating a unique state 
every time. There are three threads in the program: 
Main, FirstTask and SecondTask. Both FirstTask and 
SecondTask have one decision point and hence two pos- 
sible branches (4 branches in total). During execution 
the coverage soon converges to: 0 out of 4 for Main and 
1 out of 4 for both the other two threads. But clearly 
Main does not even have any branching points, hence 
it in fact has full coverage, whereas the other two have 
50% coverage each. 

To achieve full coverage during testing of this exam- 
ple is hard since it is dependent on the scheduler. A 
model checker can however do it, by trying all inter- 
leavings. When the JPF feature to limit the length of 
depth-first paths is used, the error is discovered within 
seconds. In terms of coverage, the FirstTask stills has 
50% coverage whereas SecondTask has 100% coverage 
when the error is found. This shows another unantic- 
ipated feature of the coverage measure: even when an 
error is found it might produce interesting results. Note 
in this case the one thread is still not fully covered, and 
closer inspection indicates that although the deadlock 
occurred due to a race condition that manifested itself 
in the SecondTask, the same thing could also have hap- 
pened in the other thread. Admittedly, this is a trivial 
example, and this kind of information may be harder to 
interpret in real sized examples. 

In order to experiment with the coverage-guided model 
checking idea we adapted the JPF scheduler to ignore 
threads in which the next statement is a branching state- 
ment which has already been taken n times before. For 
example, with n — 10, if the true branch is to be taken, 
but it has already been taken 10 times before, then ig- 
nore it and rather schedule another thread - the hope 
being that some other interleaving of statements will 
cause the false branch to be taken instead at some later 
point. This required a trivial change to the JPF sched- 
uler. 

If we now analyze the program from the previous sec- 
tion with this special branch-scheduling feature enabled 
(and with no depth limit) the error is found instantly 
(even with n as small as 1 !). This is an extremely en- 
couraging result, but the solution adopted is rather sim- 
plistic, and may ignore too much of the state-space. A 


more general approach will be to order the statements 
to be executed next in the model checker according to 
whether or not they will improve coverage (rather than 
just ignoring statements that will not improve cover- 
age). We plan to extend the JPF scheduler such that 
user-defined cost-functions can be added before model 
checking to rank transitions for execution. This will 
allow not just coverage measures to influence the rank- 
ing, but also other heuristics such as shortest path to 
a blocking statement (that might improve deadlock de- 
tection). 

4 Conclusions 

We have discussed the use of coverage measures for 
program model checking, where in practice analyz- 
ing all possible behaviors is not tractable. We have 
shown with a simple example that decision coverage 
for model checking can, not only show how well the 
model checker is doing, but also be used to guide a 
model checker. 

The next step it to see how well these techniques work 
on industrial size problems. We are fortunate to have a 
Java version of the DEOS kernel from Honeywell that 
contains a subtle error. We, and others, have shown that 
this error can be discovered with model checking [6, 4], 
We would now like to see how the coverage changes in 
cases where this error is found, and when it is not. This 
analysis of DEOS is ongoing and we hope to present 
preliminary results at the workshop. 
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